ILLUMINATING THE WEB
Smarter software
and know-how
are mining
the nooks
and crannies conventional
tools can't
reach
WHEN HE
FIRST POSTULATED
the World
Wide Web
two decades
ago, Tim
Berners-Lee dreamed
that "all
the bits
of information
in every
computer ... on
the planet
would be
available to
me and
to anyone
else. There
would be
a single,
global information
space." In
theory, his
dream has
come true. Anybody
with an
Internet connection
can access
the vast
treasure of
information available
on the
Web. In
practice, though,
many of
those riches
remain hidden
from view.
The Web
universe is
simply too
big and
too strange
a place
for simple
search engines to
navigate and
make sense
of.
Internet researchers
used to
believe that
while a
good search
engine can rummage
through as
many as
1.5
billion Web
pages, at
least as
many again
were overlooked
for various
technical reasons.
But a
white paper
last year
from Internet
search company
BrightPlanet concluded
that the
invisible Web
is actually
about 500
times larger
than that
and almost
certainly growing
faster than
the visible
Web. Mercifully,
help is
at hand.
The realization
that we
barely skim
the surface
and the
value of
what's
out there
is spurring
the development
of some
remarkable technology
that can
dig deeper
and smarter.
It's
not that
Google, HotBot
and other
popular search
engines are bad
- in
fact they're
better than
ever and
improving constantly
- but
the technology
they employ
is no
match for
the sheer
speed and
diversity of
Web growth.
Search engines
rely mostly
on crawlers,
software robots
that hop
from site
to site
and from
one URL
(uniform resource locator,
or Web address) to another,
indexing the
contents of
pages as
they go.
For most
pages, crawlers
do a
fine, if
slow, job.
But when
they bump
into sites where
information is
held inside
a database,
they grind
to a halt.
Databases, which
make up
the biggest
single component
of the
invisible Web,
need to
be specifically
queried before
they can
generate the
page of
information you
are looking
for. Governments,
public and
private institutions,
libraries, businesses,
organizations and
enthusiasts maintain
databases as
efficient tools
for managing
their information.
Putting them
on the
Web makes
them a
tremendous resource
for everyone.
Say you
want to
find a
fondly remembered,
long-out-of-print book.
A conventional
search engine might
turn up
a few
old references
to it.
But if
you know
about Advanced
Book Exchange
(www.abeboohi.com),
a 28-million-title
catalog of
the holdings
of 8,000
booksellers worldwide,
there's
a good
chance you
will find
it listed
there and
be able
to buy
it on
the spot. "People have
to realize
that if
they rely
only on
general search
engines to find
material, they're
going to
find it
either not
easily or
not at
all," says
Gary Price,
a freelance
researcher.
To help
in that
quest, Price
and search
consultant Chris
Sherman have
written The
Invisible Web
(Information Today;
399 pages),
a comprehensive
attempt at
mapping the
invisible Web.
It details
how current
search technology
works or,
more to
the point,
why it
often doesn't:
search engines are
expensive and
cumbersome to
maintain, often
taking four
to six
weeks to
revisit and
reindex a
website. Even
then they'll
probably not
burrow beyond
the first
level or
two of
data, especially
if they're
in a
large corporate
or academic
site. And
a crawler
is often
stymied by
complicated offerings
like movies,
sound files,
images or
Microsoft Word
documents. Price's
remedy: "Learn
where to
find these
invisible resources
and build
your own
collection of
them." Thankfully
and logically,
the copious
collection in
his book
is also
to be
found at
www.invisible-web.net.
The search
engines' limitations
might not
matter to
those who
still see
the World
Wide Web
as a
free-of-charge garden
of delights
- the
very word
browser, after
all, implies
idle curiosity.
But for
information-dependent businesses,
the reality
is different.
"Companies have
spent billions
of dollars
on intranet
infrastructures, knowledge
management systems
and customer
relationship management
systems, and
the best
return on
investment they've
had so
far is
e-mail," says
Mahendra Vora,
CEO of
Intelliseek, one
of several
new companies
aiming to
unlock the
potential of
the invisible
Web for
their customers.
Launched in
Cincinnati in
1997, the
firm (www.intelliseek.com)
began providing
deep search
resources for
individual researchers,
but its
real targets
are the
intranets of
global corporations.
Among its
biggest clients
are Goldman
Sachs and
Procter & Gamble.
Also Nokia
and Ford,
which -
along with
In-Q-Tel, the
high-tech investment
arm of
the Central
Intelligence Agency
- put
up much of
the $9.4
million in
venture capital
Intelliseek has
received in
recent weeks.
Companies like
Intelliseek, Bright-Planet
and Moreover
(see box)
are part
of a
business intelligence
technology market
that will
grow, according
to the
technology research
firm IDC,
from $3.6
billion this
year to
$11.9
billion in
2005. They
are not
necessarily a
threat to
traditional data-peddlers,
such as
Dialog and
Lexis-Nexis, which
have been
delivering information
to businesses
since before
the World
Wide Web
was invented
and have
archives stretching
back decades.
But their
focus on
the flickering,
free or
low-cost information
of the
here and
now is
something the
old guard
will have
to respond
to.
Intelliseek's
software can
be set
up to
monitor and
query the
databases of
news sites,
chatrooms and
Usenet groups
for trends,
product information,
gossip about
your company
and your
competitors. "We
identify the
best sources
for a
topic, company
or individual
then mine
the information
automatically, aggregate
it, filter
it, clean
it, index
it, relevance-rank
it, auto-categorize
it and
move it
into the
matrix," says
Vora. Often
the most
useful information
is already
sitting on
a company's
own network.
E-mail from
customers and
clients can
be a
goldmine if
it's
harvested and
made searchable.
Vora cites
- but
won't
name -
a global
multimillion-dollar company
that has
700 Lotus
Notes databases
on its
network. Because
they are
not searchable,
employees there
have no
idea that
much of
what they
need is
already on
their network.
"The problem
is so
serious they
are ashamed
of it,"
he says.
Medium to
large companies
can expect
to pay
between $100,000
and $300,000
a year
for Intelliseek's
services. Individual
searchers can
exploit some
of the
same expertise
for free
at www.profusion.com,
where handpicked
collections of
resources are
grouped and
searchable by
subject. More
specialized and
tightly focused
search tools
are the
kind of
solutions to
the invisible
Web's
sprawl you
can expect
to see
more of,
says Barbara
Quint, editor
of Searcher,
a journal
for database
professionals. '"What
you get
are high
quality sites,
preselected directories
and metadata
[data about
data] collections.
They may
be a
minuscule proportion
of what's
on the
Web, but
hopefully they're
the good
stuff."
As for
the general
search engines, don't
write them off just
yet. AltaVista
last month
unveiled search
and retrieval
software that
can handle
more than
200 different
file formats
on company
intranets. Over
the past
few weeks,
Google has
begun indexing
text held
in Adobe's
popular Portable
Document Format
(PDF) and
has added
five years'
worth of
postings on
the Usenet
discussion group
network, plus
a five-language
webpage translation
service and
a search
facility for
more than
150 million
images. The
San Francisco
company says
it plans
to float
a share
offering before
the end
of the
year, though
with a
relatively modest,
post-dotcom-shakeout price
tag of
$250 million.
At the
outer limits
of the
deep Web,
even non-text
media are
beginning to
sway to
the algorithms
and analytical
software of
Net technology.
Scientists from
the Norwegian
company FAST
are showcasing
a search
engine, at www.alltheweb.com,
that can
handle sound
files, images
and movies.
Virage's
technology for
encoding, indexing
and publishing
streaming media
like audio
and video
broadcasts is
being used
at www.westminsterlive.tv
to link
the text
of proceedings
in Britain's
Houses of
Parliament to
Web broadcasts
of them.
And at
www.speechbot.com,
Compaq's
experimental voice-recognition
software is
transcribing Web
TV and
radio programs
automatically. "Online
is the
preferred environment
for almost
everybody at
this point,"
says Quint,
"It's
the birth
of the
universal library,
in a
sense." Many
years of
construction remain
before it
can be
inaugurated, but
as the
invisible Web
swims into
focus, we
begin to
glimpse the
awesome scale
of Tim
Berners-Lee's
vision.